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Abstract. The problem of automatically extracting co- referential objects 
from multi-source POI datasets in Chinese has not been perfectly resolved 
for the low rate of recall and accuracy caused by the position deviation, 
names and addresses confusion. This paper proposes an automatic extract- 
ing method for co- referential objects from multi-source POI datasets based 
on position correction and semantic matching. The main step of the method 
including: (1) semantically selecting entities of same name and same ad- 
dress, (2) locally- position- correction to to-be-merged POI dataset based on 
tri angulation, and (3) extracting remaining co- referential entities by Near- 
est Neighbor Method and semantic-comparation. Results of experimental 
tests show that this method achieved good effect in recall and precision 
rates. 
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1. Introduction 

The accuracy and currency of POI iscrucial for the avail ability of geograph- 
ic information services, which requires the database of POI enriching and 
updating continually. At present, on one hand, POI data updating mainly 
uses the way of artificial regeneration and the greater the amount of data, 
the higher the maintenance costs; on the other hand, the POI information 
resource from the Internet is increasingly rich and can serve as an im- 



portant data source of POI data updating. Therefore, it is necessary to study 
thefusion and update of multi-source POI data automatically. 

Extracting co- referential object (the same geographic target in real life) au- 
tomatically is the basis of multi-source POI data fusion. Lacking of unified 
entity Naming rules, address coding rules, attribute naming rules and oth- 
ers, combined with the inevitable errors producing when different service 
providers acquire and process positional information of POI, resulting in 
the POI data from differ ent sources is often difficult to fuse directly. Mainly 
di spl ays i n the f ol I owi ng respects: 

• The positioning coordinate of co- referential object in different POI data 
exists nonlinear deviation and unableto set position together. 

• The naming method of co- referential object in different POI data is not 
identical. 

• The address representation of co- referential object in different POI data 
is not identical. 

Therefore, this paper proposes a method of extracting co- referential objects 
automatically from multi-source POI Datasets based on the position correc- 
tion and semantic matching. It can solve the low automatic matching rate 
and accuracy of co- referential object caused by the difference of position, 
name and address. 



2. Related technologies 

2.1. Semantic matching of entity's name 

The aim of semantic matching of entity's name is to calculate the similarity 
of entities' name. Now the technology of name matching in english is al- 
ready relatively mature; but for Chinese, this problem is still not completely 
resolved for structural instability, complex nested relationships, lacking of 
iconic words etc. At now, the matching method of entity's name in Chinese 
is mainly focus on the matching of the key words contained in names. LiuX 
proposed a segmentation and matching method based on the lucene Chi- 
nese POI name, realizing fuzzy matching according to the different roles of 
the POI segmentation unit. Zhang X achieved the automatic identification 
of Chinese organization name by analyzing the structure of the Chinese or- 
ganization name. Li J presented a method of identifying the Chinese organ- 
ization name based on template matching according to the unknown words 
in the Chinese organization name. Yu H proposed a method of Chinese or- 
ganization name identification based on rolelabeling. 



2.2. Geocoding 

Geocoding is a process of associating address information expressing spa- 
tial location with space actual coordinates, indicating that making the ad- 
dress data mapped into geographical coordinates. It is a space positioning 
technology based on the text information of address. The general method is: 
Splitting the address string of geocoding, then standardizing address by 
address model expression, next matching the field value of standardized key 
address and the corresponding field attribute value of geographical entity in 
the spati al reference data, fi nal ly sel ecti ng geographi c coordi nates of match 
results and assigning the value to corresponding attribute, so as to realize 
the effective coding to address. So far, geocoding software tools for the lan- 
guage of english has matured, but those for the language of Chinese are still 
at the prel i mi nary and expl oratory stage. 

2.3. Position-correction 

The main method of Position-correction is to establish the mapping rela- 
tionship of corresponding point in two different spaces. Conversion and 
mapping can be achieved between two different coordinate systems, regular 
deformation error can be avoided or weaken in the acquisition process, and 
provided technical support for multi-source data. The common transfor- 
mation method of space position of POI data is: establishing the coordinate 
transformation model of affine, similarity and projection (like the Remote 
Sensing I mage Correction in Photogrammetry), automatically selecting se- 
ries point of same name in different data sources and different coordinate 
systems, cal cul ati ng the parameters of transformati on model usi ng the I east 
squares method, then executi ng geometri c transformati on to the maps data. 

2.4. Matching of geographic entity 

Geographic entity matching aims to identify the same feature i n two differ- 
ent data sets. There are many elements for the matching such as location, 
shape, structure, topology, names, attributes and so on. According to the 
different elements, the matching method can be divided into geometric 
matching, topological matching and semantic matching. Geometric match- 
ing select candidate entity based on entity space attribute through the cal- 
culation of space distance. Topology matching is based on the topological 
relations measurable and computing of the name and address for the can- 
didate entities. Semantic matching is the method by comparing semantic 
i nformati on of candi date entiti es. 



3. The automatic extracting method of multi-source 
POI co-referential object 

This paper presents an automatic extracting method of Multi-source POI 
co- referential object with fusing of semantic analysis, geocoding and ad- 
justment processing method. It is shown Figure land can be described as 
follows: Firstly, selecting entity of same name and address through address 
standardization and similarity matching. Secondly, executing the position 
correction on the base of the POI set with the same name and address, then 
integrated computing the corrected data based on the spatial distance and 
semantic similarity. Finally extracting the co- referential objects from the 
POI datasets. 
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Figure L The workflow of extracting multi-source POI co- referential object. 



3.1. Semantically selecting the entity of same name and ad- 
dress 

3.11 Address standardization and semantic matching 

Learning from other geocoding model such as Dl ME model, TIGER model 
and ESRI model, this paper designed a hierarchical model of geocoding 
with combing the Chinese specifications (Figure2). It mainly consists three 
parts: administrative division, address and sub-address. Administrative 
division includes country name, province name, city name, area name and 
county name. Address includes fundamental and extension part of address. 
And sub address includes fundamental and extension of sub- ad dress. 
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Figure 2. Chinesegeocoding mode. 

According to the designed model of geocoding, the expression of standard 
address is: 

<The standard address>:: = < country name] [province name] [city name] 
[district county name]x[fundamental address] [extended ad- 
dress] x[ fundamental sub-address] [extended sub-address]> 

The process of address standardization is to express any address based on 
the mode of standardized address, and automatically fill the missing part of 
higher level according to the affiliation elements. So that it can form a 
standardized address expressions with the end of the most low address. 

The standardized address matching is executed layer by layer from top to 
down and high to low. And the semantic matching results can be divided 
into three cases: 

• Exactly matching: the number and name of element layers are com- 
plete! y th e same after stan d ar d i zed . T h i s bel on gs to exactl y match i n g. 

• Compatibly matching: when all elements in each layer have matched 
successfully, another standardized address still exist subordinates ad- 
dress elements. It illustrates the accuracy of the two addresses are dif- 
ferent, and they are mutually compatible. This belongs to compatible 
matching. 

• M ismatching: standardized address appears the address elements can- 
not be matched exactly layer by layer, and it belongs to mismatching. 

3.L2 POI's name Matching with Similarity Calculation 



POI's name Matching is a processed to determine whether it is same in 
name and address according to the text name of POI . I n the Chinese name 



expression of POI , it can play different role due to the different location, so 
the effect i n word matchi ng of POI 's name is different. 

Therefore, we propose a matchi ng method of POI 's name based on role tag- 
ging which isshown in Figure3. Firstly, tagging the rolefor wordsin POI's 
name with the phrase segmentation and using word dictionary. Secondly 
extracting the central word in POI 's name, then cutting in the POI 's name 
with the help of central word. And finally calculating the similarity of the 
POI 's name with the results of cutti ng. 
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Figure 3. The process of POI 's name matching. 

The si mi I ari ty cal cul ati on f ormul a of POI 's name matchi ng i s as f ol I ows: 

P(a,b) = X ■ P(w a _ center ,w b _ center ) + p x ■ P(w a _ v w b _ x ) 

+ Pl'P K-2> »V2) + • • • + Pi ■ P (*a-i> *b-i) (1) 

P (a b) 

indicates the probability of a and b are the same POI, 

W w 

a-center indicates the central word of a, indicates the i-th modifier 
of a, ^ , & ... Pi indicates the parameter between [0,1] and 

According to the result of name matching, we can divide it into three cases: 

• Exactly matching: the center words and all modifiers of POI name are 
exactly the same, or the center word is the same and most of the modi- 
fiers are the same and which make the matching similarity greater than 
a sped f i c th reshol d . 1 1 bd ongs to compl ete match i ng. 

• Type matching: the center word of POI name is the same or belongs to 
the same class and most of the modifiers are the same. And it makes 
the matching similarity beyond a specific threshold. It bdongs to type 
matching. 

• Mismatching: the center word of POI name does not bdong to the same 
class and most qualifiers are different, and the matching similarity is 
lower than a specific threshold. It bdongs to mismatch. 



Role tagging 



Cropping based 
on central word 



3.13 Selection of POI object with the same name and address 



We can make a permutation and combination according to the results of 
POI address standardization in Section 3.11 and POI's name matching 
in Section 3.12, then form a matching result matrix of POI 's name and ad- 
dress, as shown i n Table 1 



narne^^ddress 


Exactly matching 


Compatibly matching 


Mismatching 


Exactly matching 


M11 


M12 


M13 


Type matching 


M21 


M22 


M23 


Mismatching 


M31 


M32 


M33 



Table L The matching results of POI object name and address. 



• M 11 i ndi cates the obj ect set of whi ch both the name and the address of 
POI are exactly matched. 

• M 12 indicates the object set of which the name is exactly matched and 
the address is compati bly matched. 

• MB indicates the object set of which the name is exactly matched and 
the address is not matched. 

• M 21 i ndi cates the obj ect set of whi ch the name bel ongs to type match i ng 
and the address is exactly matched. 

• M22 indicates the object set of which the name belongs to type mat- 
ching and the address bel ongs to compatible matching. 

• M23 indicates the object set of which the name belongs to type mat- 
ching but the address is not matched. 

• M 31 indicates the object set of which the name is not matched but the 
address is compatibly matched. 

• M32 indicates the object set of which the name is not matched but the 
address bel ongs to compati bl e matchi ng. 

• M33 indicates the object set of which neither the name nor the address 
of POI is matched. 

The selection of POI objects with the same name is based on the combined 
results of name matching and addresses matching, and the POI object in 
pairs in Mil set as the entities with same name and address can be selected. 

3.2. Position-correction 

Deviation inevitably exists among different POI datasets, so it is necessary 
to do position- cor recti on before we start fusing work to reduce the posi- 
tional deviation between the two sets of POI data. In most cases, the posi- 
tion deviation of different POI data is nonlinear, so we can't use a linear 
transformation function to carry out simple conversion. This paper presents 
a processing method of making local linear transformations based on tri an- 
gulations. 



I n Section 3.1, entities with the same name and same address(Set M ) have 
been extracted. Now, we make Delaunay tri angulation to the spatial extent 
of set B, using the entities in Set M as the verte, forming triangular meshes 
T (SetT) as foil owing (Figure 4). 
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Figure 4. Triangulation of POI set B using the points in Set M 

In triangleT, in SetT, assuming the coordinate transformation of B-»A asa 
local linear transformation and the transformation parameter is 
D i = (a^b^m^n.) , the corrective function isgiven as following : 



X a= a i X b+ b i (2) 
Y a= m i Y b+ n i (3) 

To get the transform parameters of D i ={a i ,b i ,m i ,n i ) , the method of 

LS(Least Squares) is used. So the position of the corresponding object in 
POI set A and B of T's three vertices needs to satisfy the following condi- 
tions: 

M(a„b i ) = f j (X a -a i X b -b,) 2 =min 

k=o (4) 

n 

M{m i ,n i ) = Y J (Y a - m i Y b -n i ) 2 = min 

(5) 

Thecalculatingformula of a i and b i is: 



(Z x» + <Z = Z^x* 

lr-0 t-0 t-0 

< 

(fx,>, +(^+i)^ = z^ 

t i.-o t-o (6) 

The calculating formula of m i andn, is: 

k-G k-Q ji-0 

(Z^) ffl i + C" +1 )"i =Z } ^ 

V t_o t_o (7) 

When getting the local transformation parameters for each triangle in setT, 
each triangle (Ti eT) is enumerated, within which the position of each POI 
object will be recalculated using equation (2), (3), to get the new POI set Bt. 
Pseudocodes in C# style are listed in Table 2: 



Arrayl_ist<POI>Bt; 
for each (TriangletinT) { 
for each ( POI p in t ){ 

POI pn = new POI (); 

Pn.X =a t xP.X+b i ; 

Pn.Y = m, xP.Y + n, ; 

} 

Bt.add(Pn); 

} 



Table 2. Each triangle (Ti eT) is enumerated to get the new POI set Bt 

3.3. Automatic calculation of co-referential object 

After position- correction in Section 3.2, One-Side Nearest Neighbor Meth- 
od is used to get the co- referential objects that were omitted in Section 3.1 
Giving a threshold to each point Ai in Set A, some points in Set Bt will be 
selected out as co- referential candidates. 
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Figure 5. Use one-side nearest neighbor method to calcu I ate co- referential object. 

According to combination relation between name matching and address 
matching in Table], we can easily make sure that the co- referential object e 
is only in object Sets of M 12,M21M22,yet there are no same POI objects in 
M12,M22,M31M32,M33, they are could not be co- referential objects. 
Therefore, co- referential object can be only in M12, M21 M22. 

When select tactics of co- referential objects, this paper define that priority 
selection sequence of co- referential object in M12, M21 M22 is: M12 > 
M21 >M22. The reason is the granularity of user address labels are often 
not the same, making the probability of same address accuracy and com- 
pletely equation so small, when address match is compatible and name 
match isexactly, objects to select M12 concentrated object priority.M21may 
have different objects with same address, such as there are different POI 
objects in a building, objects address is exactly the same, but does not be- 
long to the same object. The probability of this situation in the real world is 
relatively large, co- referential object probability in M21 is less than M12, 
but greater than M 22. 



4. Experiment and discussion 
4.1. Experimental data and test results 

In this paper, we manually collected some POI objects in the region of 
Lianhua Bridge in Haidian District of Beijing, China (W 116.304997, 
S39.888172, E 116.335521 N 39.905651) on the map of www.mapbar.com 
and map.baidu.com. There are 219 POI s from M apABC (left one of fi gure 6) 
and 318 POIsfrom baidu (right one of figure 6). 




Figure 6. POI distribution map in experi mental zone (left: mapbar, right: baidu). 

First we manually checked each POI of thetwodatasetsand get the number 
of co- referential entities ( thetotal number of co- referential entities is 143). 



Next simple text matching algorithm and name-address standardization 
algorithm were used respectively to find entities with same name and same 
address. Result of this step showed that we get 25 objects by using name - 
address standardized method, while simple text matching method only 
found 7. 

After that Nearest Neighbor Method were used to find the rest of co- 
referential objects. First locally- position-correction based on tri angulation 
and globally- position- correction were applied to POI from baidu.com, and 
we got two sets of "position-corrected" POI s; Then we use Nearest Neighbor 
Method (Threshold=10 meters) to the raw ses and the 2 "position- 
corrected" sets to calculate co- referential object. The results were shown in 
Table 3. 



Position-correction 


return results 


correct results 


recall rate 


accuracy rate 


None 


54 


32 


27.1% 


59.3% 


globally-position 


68 


37 


31 .4% 


54.4% 


locally-position- 
correction based on 
triangulation 


102 


83 


70.3% 


81.4% 



Table 3. Nearest Neighbor Method calculation results. 



By summarizing results of all steps, wegotTable4. 1 1 shows that the meth- 
od in this paper, by which the recall rate and accuracy rate respectively 
reach to 75.5% and 85.0%, has best effect 



Position-correction 


total number of co- 
referential objects 


total number of 
correct co- 
referential objects 


recall 
rate 


accuracy 
rate 


None 


79 


57 


39.9% 


72.2% 


globally-position 


93 


62 


43.4% 


66.7% 


locally-position- 
correction based on 
triangulation 


127 


108 


75.5% 


85.0% 



Table 4. Algorithmic effect. 



4.2. Analysis and Discussion 

This paper proposed the method based on position- correction and semantic 
matching can effectively reduce the POI location deviation from different 
sources, and standardize the name, address and other information in POI . 
Experiment shows that it achieves effective in recall and precision rates and 
is a good method to deal with multi-source POI for finding the co- 
referential entities. It can be applied in the field of POI data fusion from 
different geographic information website particularly. 

To further enhance the proposed method, it can be done in the following 
steps: it is necessary to establish Stable model of triangulation, do more 
research of calculating and merging the attribute of POI entities, and en- 
hance the effectiveness and performance of mass data processi ng and so on. 
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